ggplot2 tutorial

<<< Back to Index

Load the data

You first need to load the data.

Follow the procedure using the interface to load the file or try to use the function read.table.

Clinical data: clinical_data.csv, name the variable clinical

Code
# or you can try to adapt this code
clinical <- read.table("datasets/clinical_data.csv",sep=",",head=T)
head(clinical)
  patient_id      age gender smoking_status   height   weight
1          1 64.96714   Male         Smoker 180.9910 98.83183
2          2 58.61736   Male     Non-Smoker 162.6784 53.58433
3          3 66.47689   Male     Non-Smoker 176.1161 76.92134
4          4 75.23030   Male     Non-Smoker 165.7124 69.74749
5          5 57.65847   Male         Smoker 161.1106 70.78215
6          6 57.65863   Male     Non-Smoker 179.1095 68.22497
  blood_pressure_systolic blood_pressure_diastolic cholesterol   glucose
1               122.50182                 78.18016    203.4560  67.06891
2               109.61795                 75.56966    194.7213 103.10724
3               113.57004                 91.31688    160.4247 119.15123
4               110.89187                107.75532    145.2192 101.06535
5                98.80158                 81.78133    138.7112  89.83955
6               136.26337                 83.63899    214.1575 194.89959
  heart_rate hospital_visits_last_year     diagnosis treatment_group
1   60.04614                         2 Heart Disease         Control
2   88.67212                         6 Heart Disease          Drug B
3   83.17836                         4    No Disease         Control
4   79.68278                         2      Diabetes          Drug A
5   57.38393                         5  Hypertension          Drug A
6   96.59123                         0      Diabetes         Control
  treatment_response
1          No Change
2          No Change
3          No Change
4          No Change
5           Worsened
6          No Change

What is a plot ?

A graphical plot gives us a visual summary of total (or some parts) of a dataset. It is composed of several elements :

  • Data : It is what we want to represent, it is the heart of the plot.
  • Axis : They allow you to set visualization limits and understand data scale.
  • Labels : This allows you to give names to the axes, so you know what they represent. It is also possible to label data directly on the plot.
  • Shape : Depending on the data we have and we want to have as representation, we’ll obtain a geometric shape that can range from a simple point, through a line, to more complex forms.

How ggplot2 works ?

The “gg” in ggplot2 stands for Grammar of Graphics. So we’re going to use some special verbs to help us build our final graph.

ggplot2 basically works like a cake with several layers.

Each layer has a role to play in the final rendering. As with a real cake, it’s not necessary to use every available layer to make a good plot. The more information, colors or labels a plot contains, the more likely it is to become illegible. You need to find the right balance. (See in slides into the ggplot2 section)

Install ggplot2 and load the library

Code
install.packages("ggplot2")
library(ggplot2)

Display your first plot step by step

Before unfolding the code and answers boxes, try to find the solution from what we have seen in the lecture or by searching on the web, it should become an automatism, no one knows everything !

This is what our plot will look like. Now, let’s break it down layer by layer and build it.

  1. Use the function ggplot() alone.
Code
ggplot()

A default grey background is displayed.

  1. Display the axis: Add the data and choose the x and y axis from the column names that you can display using colnames. For instance, we want to evaluate the distribution of the Glucose concentration (y-axis) in function of the diagnosis (x-axis).
Code
# Display the column names
colnames(clinical)
 [1] "patient_id"                "age"                      
 [3] "gender"                    "smoking_status"           
 [5] "height"                    "weight"                   
 [7] "blood_pressure_systolic"   "blood_pressure_diastolic" 
 [9] "cholesterol"               "glucose"                  
[11] "heart_rate"                "hospital_visits_last_year"
[13] "diagnosis"                 "treatment_group"          
[15] "treatment_response"       
Code
ggplot(data = clinical, mapping = aes(x=diagnosis, y=glucose))

Code
# Same code as above but simplified, without the name of arguments
ggplot(clinical, aes(diagnosis, glucose))

On this line, we tell ggplot to map diagnosis on the x-axis and glucose on the y-axis. Note that the axis are automatically labelled with the column names.

The default theme displays a grey background and a white grid. We will see that is easy to change this features.

  1. Display the geometric shape: boxplot

Add a layer to the plot. Which is the function to use ?

Key words to write on your favorite browser
  • R ggplot2 boxplot
Code
ggplot(data = clinical, mapping = aes(x=diagnosis, y=glucose)) +
  geom_boxplot()

Customize by adding layers and modifying aesthetics

  1. Change the x-axis labels: we want to make it prettier: e.g Type of Disease and the y-axis to make it more explicit: e.g Glucose (mg/dL)
Code
ggplot(data = clinical, mapping = aes(x=diagnosis, y=glucose)) +
  geom_boxplot() + 
  labs(x="Type of Disease", y="Glucose (mg/dL)")

  1. Change the global theme to have a white background more suitable for figures and projection
Key words to write on your favorite browser
  • R ggplot2 theme white background
Code
ggplot(data = clinical, mapping = aes(x=diagnosis, y=glucose)) +
  geom_boxplot() + 
  labs(x="Type of Disease", y="Glucose (mg/dL)") +
  theme_bw()

  1. Change the color and fill of the boxplots
  • All the boxplots in red and filled with grey.
Code
ggplot(data = clinical, mapping = aes(x=diagnosis, y=glucose)) +
  geom_boxplot(color="red", fill="grey") + 
  labs(x="Type of Disease", y="Glucose (mg/dL)") +
  theme_bw()

You can play globally on the color of the borders (parameter color), the filling (parameter fill), the transparency (parameter alpha)) and more. The accessible aesthetics can vary according to the geometric shape, form. instance for lines you will have the width and the type.

  1. Change the color of the boxplots in function of the type of disease
  • During the previous step the same colors have been applied to all boxplots, the goal here is to use the information contained in a column of our dataset (diagnosis). This is part of the aesthetics. Based on the current code, how would you add this information (2 possibilities) ?

Option 1

Code
# Directly in the aesthetics of the ggplot function that will be accessible for all layers
ggplot(data = clinical, mapping = aes(x=diagnosis, y=glucose, color=diagnosis)) + # Add the new aesthetics color
  geom_boxplot() + # remove the global colors
  labs(x="Type of Disease", y="Glucose (mg/dL)") +
  theme_bw()

Option 2

Code
# Within the geom_boxplot() function which will define the aesthetics 
# with aes() only for this layer
ggplot(data = clinical, mapping = aes(x=diagnosis, y=glucose)) +
  geom_boxplot(aes(color=diagnosis)) + 
  labs(x="Type of Disease", y="Glucose (mg/dL)") +
  theme_bw()

By default the colors are not very contrasted … you can customize them by :

  1. Defining your own palette optionally by attributing a color to each category ;

  2. Find a predefined palette to help you to define the colors. To do this task, you will need to add a new layer by using a function that help in scaling the colors manually.

i) You can search for how colors are named in R if you want to try your favorite colors. Many palettes are available through dedicated packages. (My favorite function to call for colors is colours() systematically installed and loaded with R, you can try some of my preferred ones: colours()[124], colours()[613], colours()[53], colours()[144]).

Key words to write on your favorite browser
  • R ggplot2 colors
Code
# the fonction colours() returns the name of the color
colours()[124]
[1] "deepskyblue3"
Code
# use the function c() to concat several elements, we call it a vector
c(colours()[613], colours()[53], colours()[144]) 
[1] "springgreen3" "chocolate1"   "gold2"       
  • R ggplot2 boxplot color

To define your own palette, you need to concatenate as many colors as the number of categories, here 4 colors. You can assign the colors to the categories by building a named vector.

Code
# We use the function c(), to name the elements of a vector you use the name you want
# and you assign the values with the sign =
palette.disease <- c("Diabetes"=colours()[144],
                      "Heart Disease"=colours()[34],
                      "Hypertension"=colours()[613],
                      "No Disease"=colours()[24])
# Display the content
palette.disease
      Diabetes  Heart Disease   Hypertension     No Disease 
       "gold2"       "brown2" "springgreen3"        "black" 

Now you have your colors, add the layer to the ggplot object and use the variable palette.disease you just created.

Key words to write on your favorite browser
  • R ggplot2 scaling color manually
Code
ggplot(data = clinical, mapping = aes(x=diagnosis, y=glucose)) +
  geom_boxplot(aes(color=diagnosis)) + 
  labs(x="Type of Disease", y="Glucose (mg/dL)") +
  theme_bw() +
  scale_colour_manual(values=palette.disease)

ii) R natively provides a few continuous color palettes which means that you do not need to install and load any library. Also when you install ggplot2, it installs some dependencies (packages that ggplot2 needs to work properly), among them it installs the packages RColorBrewer, viridis, paletteer, khroma etc. and have implemented specific functions to use the palettes, it means that those packages are installed but not loaded. You can see which functions are available when typing scale_color_ on RStudio. If you want to use the functions from these packages you need to load the package but it is not mandatory. Finally, a lot more palettes are available through packages.

Native palettes

Code
# rainbow, heat.colors, terrain.colors, topo.colors, cm.colors
ggplot(data = clinical, aes(x=diagnosis, y=glucose)) +
  geom_boxplot(aes(color=diagnosis)) + 
  labs(x="Type of Disease", y="Glucose (mg/dL)") +
  theme_bw() +
  scale_colour_manual(values=terrain.colors(4))

Viridis palettes

Code
# options: viridis, magma, plasma, inferno, cividis
ggplot(data = clinical, aes(x=diagnosis, y=glucose)) +
  geom_boxplot(aes(color=diagnosis)) + 
  labs(x="Type of Disease", y="Glucose (mg/dL)") +
  theme_bw() +
  scale_colour_viridis_d(option="inferno")

  1. Customize theme elements by rotating the x-axis labels by 45°
Key words to write on your favorite browser
  • R ggplot2 rotate x axis labels
Code
ggplot(data = clinical, aes(x=diagnosis, y=glucose)) +
  geom_boxplot(aes(color=diagnosis)) + 
  labs(x="Type of Disease", y="Glucose (mg/dL)") +
  theme_bw() +
  scale_colour_manual(values=palette.disease) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))

  1. Flip the plot to get horizontal boxplots

There is two ways, the first one is very simple and consists in inverting x and y aesthetics, the second one involves the addition of a new layer to flip the axis (preferred option and it provides more flexibility). Lets keep our own color palette palette.disease

Option 1

Code
ggplot(data = clinical, aes(y=diagnosis, x=glucose)) +
  geom_boxplot(aes(color=diagnosis)) + 
  labs(x="Type of Disease", y="Glucose (mg/dL)") +
  theme_bw() +
  scale_colour_manual(values=palette.disease)

Option 2

Key words to write on your favorite browser
  • R ggplot2 flip axis
Code
ggplot(data = clinical, aes(x=diagnosis, y=glucose)) +
  geom_boxplot(aes(color=diagnosis)) + 
  labs(x="Type of Disease", y="Glucose (mg/dL)") +
  theme_bw() +
  scale_colour_manual(values=palette.disease) +
  coord_flip()

  1. Add another geometric shape: points

This task consist of adding a new layer containing the points colored by known oncogenes. Which function should we use ? Note that there are two answers but one is more suitable for boxplot as it is adding some random noise to the positions in order to avoid overlaps. Set the size of the points to 0.3

Key words to write on your favorite browser
  • R ggplot2 point
  • R ggplot2 point boxplot

Option 1

Code
ggplot(data = clinical, aes(x=diagnosis, y=glucose)) +
  geom_boxplot(aes(color=diagnosis)) + 
  labs(x="Type of Disease", y="Glucose (mg/dL)") +
  theme_bw() +
  scale_colour_manual(values=palette.disease) +
  coord_flip() +
  geom_point(aes(color=diagnosis), size=0.3)

Option 2

Code
ggplot(data = clinical, aes(x=diagnosis, y=glucose)) +
  geom_boxplot(aes(color=diagnosis)) + 
  labs(x="Type of Disease", y="Glucose (mg/dL)") +
  theme_bw() +
  scale_colour_manual(values=palette.disease) +
  coord_flip() +
  geom_jitter(aes(color=diagnosis), size=0.3)

  1. The boxplots are no more visible. Find a way to make them visible. Play with opacity both in boxplots and points.
Tips

We use layers that we stack one above the other.

Code
ggplot(data = clinical, aes(x=diagnosis, y=glucose)) +
    geom_jitter(aes(color=diagnosis), size=0.3) +
    geom_boxplot(aes(color=diagnosis),alpha=0.7) + 
    labs(x="Type of Disease", y="Glucose (mg/dL)") +
    theme_bw() +
    scale_colour_manual(values=palette.disease) +
    coord_flip() 

Create a density plot

Your goal is to produce the following density plot showing the density distributions of the glucose concentration in function of the diseases (use palette.disease). Here we introduce the customization of the plot dimensions, we want to force the plot to have be x-axis 2 times longer than the y-axis which we call the aspect ratio. Thanks to it, you can force to have squared plots for instance.

  1. Identify what contain the axis.
Answer

The x-axis contains the value of the column glucose and the y-axis indicates the density of the distribution, the y-axis do not need to be specified.

  1. Make a list of the aesthetics you need to modify and layers that need to be added.
Answer
  • Layers : the geometric shape, label of the x-axis, theme background, theme aspect ratio, custom color palette

  • Aesthetics : x-axis , color of the lines

  1. Write your code ! (Do not forget your best friend: the Web)
Code
ggplot(data = clinical, aes(x=glucose)) +
    geom_density(aes(color=diagnosis)) + 
    labs(x="Glucose (mg/dL)") +
    theme_bw() +
    scale_colour_manual(values=palette.disease) +
    theme(aspect.ratio = 1/2)

Create a bar plot

Your goal is to produce the following bar plot showing the number of patient that are smoking (column smoking_status - play with the transparency) in each outcome. In this case the aspect ratio is inverse compared to the previous one.

  1. Identify what contain the axis.
Answer

The x-axis indicates the number of sample falling in the different outcomes, the x-axis do not need to be specified. The y-axis indicates the different outcomes. As a reminder, one would prefer to flip the coordinates but it is not mandatory.

  1. Make a list of the aesthetics you need to modify and layers that need to be added.
Answer
  • Layers: the geometric shape, label of the y-axis, theme background, theme aspect ratio, custom color palette

  • Aesthetics: y-axis , color of the lines, fill of the bars, transparency according to gender

  1. Write your code ! (Do not forget your best friend: the Web)
Code
ggplot(data = clinical, aes(x=treatment_response)) +
  geom_bar(aes(fill=treatment_response, color=treatment_response,alpha=smoking_status)) + 
  labs(x="Outcomes") +
  theme_bw() +
  scale_fill_manual(values=c(Worsened="red","No Change"="darkgrey",Improved=colours()[613])) +
  scale_colour_manual(values=c(Worsened="red","No Change"="darkgrey",Improved=colours()[613])) +
  coord_flip() +
  theme(aspect.ratio = 2)

BONUS 1 : What happens if you set the parameter of the argument position in the geometric layer to fill ?

Answer

The y-axis shows now proportions instead of absolute values. It highlights a higher proportion of smokers in the group of worsened outcomes.

Code
ggplot(data = clinical, aes(x=treatment_response)) +
  geom_bar(aes(fill=treatment_response, color=treatment_response,alpha=smoking_status), position="fill") + 
  labs(x="Outcomes") +
  theme_bw() +
  scale_fill_manual(values=c(Worsened="red","No Change"="darkgrey",Improved=colours()[613])) +
  scale_colour_manual(values=c(Worsened="red","No Change"="darkgrey",Improved=colours()[613])) +
  coord_flip() +
  theme(aspect.ratio = 2)
Warning: Using alpha for a discrete variable is not advised.

Changing layout of the data

You are working on your dataset. You are happy because you have brilliantly succeed on doing plots from your data.

But there’s one question on your mind (maybe) : Does the smoking status seem to be associated with other factors ?

The answer to this question comes quickly. “We could plot the proportion of smokers in each category !”

Option 1 (not recommended)

You can plot independently the four barplots. It’s a possibility, after all we only have 4 factors to compare. But this method becomes more time-consuming and laborious when several dozen factors are involved.

Code
ggplot(data = clinical, aes(x=treatment_response)) +
  geom_bar(aes(fill=treatment_response, color=treatment_response,alpha=smoking_status), position="fill") + 
  labs(x="Outcomes") +
  theme_bw() +
  scale_fill_manual(values=c(Worsened="red","No Change"="darkgrey",Improved=colours()[613])) +
  scale_colour_manual(values=c(Worsened="red","No Change"="darkgrey",Improved=colours()[613])) +
  theme(aspect.ratio = 1)

ggplot(data = clinical, aes(x=gender)) +
  geom_bar(aes(fill=gender, color=gender,alpha=smoking_status), position="fill") + 
  labs(x="Patient Gender") +
  theme_bw() +
  scale_fill_manual(values=c(Male="red",Female=colours()[613])) +
  scale_colour_manual(values=c(Male="red",Female=colours()[613])) +
  theme(aspect.ratio = 1)

ggplot(data = clinical, aes(x=diagnosis)) +
  geom_bar(aes(fill=diagnosis, color=diagnosis,alpha=smoking_status), position="fill") + 
  labs(x="Type of Disease") +
  theme_bw() +
  scale_fill_manual(values=palette.disease) +
  scale_colour_manual(values=palette.disease) +
  theme(aspect.ratio = 1)

ggplot(data = clinical, aes(x=treatment_group)) +
  geom_bar(aes(fill=treatment_group, color=treatment_group,alpha=smoking_status), position="fill") + 
  labs(x="Treatment group") +
  theme_bw() +
  scale_fill_manual(values=c(Control="red","Drug B"="darkgrey","Drug A"=colours()[613])) +
  scale_colour_manual(values=c(Control="red","Drug B"="darkgrey","Drug A"=colours()[613])) +
  theme(aspect.ratio = 1)

Option 2 (recommended)

Our goal is to obtain a table in which we have all combinations of smoking habits and other features for each patient, in other words we want a line per feature and its measure in each patient and its smoking habit. The column containing the features (gender, diagnosis, treatment_group, treatment_response) will be use to split the plot into 4 panels, one per feature. This is called faceting.

i) We will first extract a data.frame containing the qualitative values and store it in a variable named qual.features. To do that use the functions provided by dplyr.

Tips

You need to use a combination of functions that allows to select columns according to a type of variable. You can find the answer in the dplyr tutorial or on the web helped or not by an artificial intelligence.

Code
qual.features <- clinical %>% select(!where(is.numeric))
head(qual.features)
  gender smoking_status     diagnosis treatment_group treatment_response
1   Male         Smoker Heart Disease         Control          No Change
2   Male     Non-Smoker Heart Disease          Drug B          No Change
3   Male     Non-Smoker    No Disease         Control          No Change
4   Male     Non-Smoker      Diabetes          Drug A          No Change
5   Male         Smoker  Hypertension          Drug A           Worsened
6   Male     Non-Smoker      Diabetes         Control          No Change

ii) We will “lengthens” the dataset qual.features so that we obtain a longer data.frame (named qual.features.long) where you will have the measures corresponding the smoking status and all other measures. The smoking_status column should be kept in the final table, the other features should be stored in a new column named features and the values stored in a column named values.

       smoking_status           features        values
               <char>             <char>        <char>
    1:         Smoker             gender          Male
    2:         Smoker          diagnosis Heart Disease
    3:         Smoker    treatment_group       Control
    4:         Smoker treatment_response     No Change
    5:     Non-Smoker             gender          Male
   ---                                                
39996:     Non-Smoker treatment_response     No Change
39997:     Non-Smoker             gender        Female
39998:     Non-Smoker          diagnosis Heart Disease
39999:     Non-Smoker    treatment_group       Control
40000:     Non-Smoker treatment_response     No Change

=> Modify dataset layout with pivot_longer() and pivot_wider()

We can change the layout of a dataset. It means, for example, transform columns into rows or the invert. This is a common task in data analysis, particularly when you need to organize data in a structure that is best suited for your analysis.

In R, this often involves transforming data between long and wide format. In a wide format, each row represent a single observation, and columns represent variables (like in qual.features). In a long format, each row is a single measurement, with a separate column for the variables and the values. This structure is particularly useful for certain graphical representations.

To switch from one shape to another, there are two functions called pivot_longer() and pivot_wider(). pivot_longer() function enables to convert multiple columns into key-value pairs, where column names become variable names, and their corresponding values are stacked. pivot_wider() function allows to spread key-value pairs across multiple columns. We say that pivot_longer() reshapes wide format to long format and pivot_wider() do the invert.

Schematic view of wide and long formats
long_tab <- wide_tab %>% 
   pivot_longer(!smoking_status,
                    "features", 
                    "values")

wide_tab <- long_tab %>% 
    pivot_wider("features", 
                    "values")

Try to play the pivot_longer() command as it is (don’t forget to use our data) … you obtain an error:

Error in pivot_longer():

! Arguments in ... must be used.

✖ Problematic arguments:

• ..1 = “features”

• ..2 = “values”

ℹ Did you misspell an argument name?

Run rlang::last_trace() to see where the error occurred.

Explore the pivot_longer() functions and identify the names of the parameters that are set in the figure example. Describe what is happening.

Answer

Search on the web or look at the help of each function.

?pivot_longer

In R, when the names of the parameters are not indicated, the order of the arguments is very important. It will correspond to the order of the arguments defined in the function.

The first parameter data is automatically filled by the stream %>% and take the data frame qual.features.

The second parameter cols do not have default argument and will take the first argument specified in the command: !smoking_status.

This argument indicates the column that should NOT be spread (the logical NO ! makes the negation), indeed we want to keep this column in the resulting table as it will be use as a layer in our plot. If the ! was not present than the specified columns are spread.

The third parameter ... is specifically designed for additional column selectors, this means you can use multiple column selection methods or helpers (like starts_with(), contains(), etc.) to specify the columns to be pivoted, but most of the time you can achieve your goal with just cols, as the need for ... is rare in practice for this function. The argument "features" and "values" in ... are not being correctly understood by the function. No such functions exist. It is what the error says:

! Arguments in ... must be used.

✖ Problematic arguments:

• ..1 = “features”

• ..2 = “values”

From here you need to find the appropriate arguments to set the name of the column that will contain the column names to features and set the name of the column that will contain the values corresponding to the pairs smoking_status-features to values.

Here it is the answer:

Code
qual.features.long <- qual.features %>%
  pivot_longer(
    # Exclude smoking_status, pivot other columns
    cols = !smoking_status,    
    # A character vector specifying the new column or columns to create from 
    # the information stored in the column names of data specified by cols.
    names_to = "features",    
    # A string specifying the name of the column to create from the data stored in cell values.
    values_to = "values"
  )

At least three ways of writing the code are possible.

Way 1

Code
# specifying the columns to pivot
qual.features.long <- qual.features %>% 
  pivot_longer(cols = c(gender,diagnosis,treatment_group,treatment_response),
               names_to = "features", 
               values_to = "values")

Way 2

Code
# specifying the columns to pivot
qual.features.long <- qual.features %>% 
  pivot_longer(cols = !smoking_status,
               names_to = "features", 
               values_to = "values")

Way 3 (tricky)

Code
# relocate smoking_status at the first column and spread from column gender to treatment_response
qual.features.long <- qual.features %>% 
                            relocate(smoking_status,.before = gender) %>% 
                            pivot_longer(cols=gender:treatment_response,
                                         names_to = "features", 
                                         values_to = "values")

You can note that the number of rows after gathering is equals to the number of rows in the original data (10000) times the number of gathered columns (4) reflecting all the combinations of smoking_status and all 4 features for each patient.

iii) We will plot the proportion of smokers (opacity) in each category value, using faceting to wrap the other features.

Key words to write on your favorite browser

R wrap facet ggplot2

Code
ggplot(qual.features.long,aes(x=values)) + 
  theme_bw() +
  geom_bar(aes(alpha=smoking_status, color=values, fill=values), position="fill") +
  facet_wrap(~features,scales="free")
Warning: Using alpha for a discrete variable is not advised.

Correlation plot

Using dplyr functions, select columns containing numeric values in clinical. Caution, the patient_id column contains numeric and we do not want it, we need to combine 2 conditions.

Code
quant.features <- clinical %>% select(where(is.numeric) & !patient_id) 

You have a very nice introduction to this package: here. All the information are in this page but don’t hesitate to search for more.

You want to obtain the following plot:

Here is the solution:

Code
# install the package that is in CRAN repository
install.packages("corrplot")
# load the library
library(corrplot)
# compute the correlation, by default it compute the correlation between columns
M <- cor(quant.features) 
# the plot !
corrplot(M, 
         method = 'square',
         addCoef.col = 'black',
         order = 'AOE',
         type = 'upper',
         diag = F,
         cl.pos = 'b')

Exercise

You first need to load the data.

Follow the procedure using the interface to load the file or try to use the function read.table.

Clinical data: TCGA_LUAD_subset.tsv, name the variable luad_clinical

Expression data : TCGA_LUAD_expression.tsv, name the variable luad_expr

Differential Expression Gene in KRAS tumors between high and low MAPK signature : TCGA_KRAS_DEG_MAPK.tsv, name the variable KRAS_DEG

# or you can use R command
luad_clinical <- read.table("datasets/TCGA_LUAD_subset.tsv",sep="\t",head=T)
luad_expr <- read.table("datasets/TCGA_LUAD_expression.tsv",sep="\t",head=T,row=1)
KRAS_DEG <- read.table("datasets/TCGA_KRAS_DEG_MAPK.tsv",sep="\t",head=T)

Questions

All answers are plots with sometimes some preliminary data formating

The answers we provide are not the unique solutions rather just examples, you can find multiple ways to show tendencies in your data.

Here are some colors you could use …
# colours for oncogenes
palette.known.onco <- c(ALK="darkgrey",
                        BRAF_act=colours()[613],
                        BRAF.non=colours()[11],
                        EGFR=colours()[128],
                        ERBB2=colours()[76],
                        MAP2K1=colours()[509],
                        MET=colours()[121],
                        NF1=colours()[468],
                        None=colours()[92],
                        RIT1=colours()[642],
                        ROS1=colours()[34],
                        KRAS="black")
  1. Are the oncogenes (column known.oncogenes) evenly represented in the dataset ?
Code
ggplot(data = luad_clinical, aes(x=known.oncogenes)) +
  geom_bar(aes(fill=known.oncogenes)) + 
  labs(x="Known Oncogenes") +
  theme_bw() +
  scale_fill_manual(values=palette.known.onco) +
  coord_flip() +
  theme(aspect.ratio = 2)
  1. Are there differences in MAPK activity (column signature_MAPK) in tumors depending on the mutated oncogene they are harboring (column known.oncogenes) ?

Solution 1

Code
ggplot(data = luad_clinical, aes(x=known.oncogenes, y=signature_MAPK)) +
  geom_boxplot(aes(color=known.oncogenes)) + 
  labs(x="Known Oncogenes", y="MAPK activity signature") +
  theme_bw() +
  scale_colour_manual(values=palette.known.onco) +
  coord_flip() +
  geom_jitter(aes(color=known.oncogenes)) +
  geom_hline(yintercept = median(luad_clinical$signature_MAPK), linewidth=0.3, linetype="dashed")

Solution 2

This solution implies a preliminary step during which you need to transform the data by summarizing the values by known oncogenes.

Code
mapk_activity_per_oncogenes <- luad_clinical %>% group_by(known.oncogenes) %>% 
          summarize(mapk_sd_low=mean(signature_MAPK)-sd(signature_MAPK),
                    mapk_mean=mean(signature_MAPK),
                    mapk_sd_high=mean(signature_MAPK)+sd(signature_MAPK))

ggplot(data = mapk_activity_per_oncogenes, aes(x=known.oncogenes, y=mapk_mean)) +
  geom_col(aes(fill=known.oncogenes, color=known.oncogenes),alpha=0.5) + 
  geom_errorbar(aes(ymin=mapk_sd_low,ymax=mapk_sd_high, color=known.oncogenes),width = 0.25) +
  labs(x="Known Oncogenes", y="Average MAPK activity signature") +
  theme_bw() +
  scale_fill_manual(values=palette.known.onco) +
  scale_colour_manual(values=palette.known.onco) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
  1. Researchers are particularly interested in some regulators of the MAPK signaling, among them we focus on DUSP4, DUSP6, SPRY2. How the expression of these genes globally correlates with the MAPK signature ?

3.1. Select the rows corresponding the phosphatase genes in the expression matrix, transpose the table, change its type to data.frame and add a column named sampleID containing the row names.

Code
phospho.expression <- luad_expr[c("DUSP4","DUSP6","SPRY2"),]
Code
phospho.expression <- t(phospho.expression)
Code
phospho.expression <- as.data.frame(phospho.expression)
Code
phospho.expression$sampleID <- row.names(phospho.expression)

3.2. Associate the expression of the genes to the clinical table and select the columns of interest.

Code
luad_clinical <- merge(luad_clinical,phospho.expression, by="sampleID")
Code
luad_clinical.genes <- luad_clinical %>% select(starts_with("DUSP"),SPRY2,signature_MAPK)

3.3. Compute and display the correlation between gene expression and MAPK activity.

Code
M <- cor(luad_clinical.genes)
corrplot(M,type = "upper", diag = F)
  1. Are these tendencies shared by the 4 most represented oncogenes ?
Code
nb.per.onco <- luad_clinical %>%
  count(known.oncogenes, name = "nb") %>%
  slice_max(nb, n = 4)

luad_clinical.genes.select <- luad_clinical %>% pivot_longer(cols = c(starts_with("DUSP"), SPRY2),names_to = "SYMBOL",values_to = "expression") %>% filter(known.oncogenes%in%nb.per.onco$known.oncogenes)

# If needed, load the package ggpubr
library(ggpubr)

ggscatter(luad_clinical.genes.select, x="expression", y="signature_MAPK", size = 0.5, add="reg.line", color = "known.oncogenes") +
  theme_light() +
  theme(legend.position = "right") +
  facet_grid(SYMBOL~known.oncogenes) +
  scale_colour_manual(values=palette.known.onco) +
  stat_cor(label.y = 1) +
  theme(aspect.ratio = 1)
  1. Which genes are the most significantly (top 10) associated with MAPK signature in KRAS patients ? (You need to load KRAS dataset)
Plot idea

You could display the genes in a volcano plot and add labels to the top 10 genes.

Code
top_10 <- KRAS_DEG %>% slice_min(P.adjusted,n=10)

KRAS_DEG <- KRAS_DEG %>% mutate(sig=ifelse(P.adjusted<0.01 & abs(logFC)>1, "significant","no"))

# If needed, load the ggrepel package
library(ggrepel)

ggplot(KRAS_DEG, aes(x=logFC, y=-log10(P.adjusted))) + theme_bw() +
  geom_point(size=0.6, aes(color=sig)) +
  geom_hline(yintercept = 2, linewidth=0.5, linetype="dashed") +
  geom_vline(xintercept = c(-1,0,1), linewidth=0.5, linetype="dashed") +
  geom_text_repel(data=top_10, aes(label=ID)) +
  scale_colour_manual(values=c(significant="red", none="darkgrey")) +
  labs(x="Log Fold Change") +
  theme(aspect.ratio = 1)

Bonus: Heatmap

In this section, we want to visualise the gene expression across all samples using a heatmap and highlight potential expression patterns using sample and gene clustering.

What is a heatmap ?

In a heatmap, each row usually represents a gene. Each column represents a sample. The rectangles containing the values are also called “cells”.

The colour and intensity of the cells is used to represent values of gene expression. So, basically, instead of numbers, we use colours. The colour will be more intense the lower (or higher) the value is.

To see patterns, we cluster (meaningful reordering) the rows and columns. This just means we group the samples and/or genes together based on the similarity of their gene expression pattern. The dendograms on the sides just indicate the results of clustering both genes and samples.

A clustering algorithm will group genes with a similar expression across samples together: groups of genes that are highly expressed in some samples, and lowly expressed in other samples will be clustered together. In a similar way, samples with a similar expression pattern (in general having certain sets of genes highly expressed and certain sets of genes lowly expressed) will be clustered together.

In summary, adding clustering to heatmaps can be useful for identifying genes that are commonly regulated, or biological signatures associated with a particular condition (for example, high MAPK activity).

Step by step

Before starting …

i) We need to find a package that allow us to draw (pretty) annotated heatmaps. Lets search on the web what we could try.

Key words to write on your favorite browser

R How to draw annotated heatmap

The package we will be using

pheatmap

ii) Install the package.

Code
install.packages("pheatmap")
library(pheatmap)

Step 1 : Prepare the data

To be able to annotate the heatmap we need to build the annotations for the samples (luad_clinical) and the genes of interest (DUSP4, DUSP6 and SPRY2) and associate colours to the categories for each feature.

i) Annotation of the samples

Using the variable clinical, extract the the columns known.oncogenes, pathologic_stage, gender, signature_group in a new variable anno.columns.

Code
gene.of.interest <- c("DUSP6", "SPRY2","DUSP4")
anno.columns <- luad_clinical[,c("known.oncogenes","pathologic_stage","gender")]
head(anno.columns)
  known.oncogenes pathologic_stage gender
1            KRAS         Stage IB   MALE
2            KRAS       Stage IIIA FEMALE
3            RIT1       Stage IIIA   MALE
4        BRAF_act         Stage IA   MALE
5            KRAS         Stage IB FEMALE
6            KRAS       Stage IIIB   MALE

To allow the tool to map the annotations to the samples in the expression dataset, we need to name the row with the sample identifiers that are stored in the column sampleID of clinical.

Code
row.names(anno.columns) <- luad_clinical$sampleID
head(anno.columns)
                known.oncogenes pathologic_stage gender
TCGA.05.4249.01            KRAS         Stage IB   MALE
TCGA.05.4250.01            KRAS       Stage IIIA FEMALE
TCGA.05.4384.01            RIT1       Stage IIIA   MALE
TCGA.05.4389.01        BRAF_act         Stage IA   MALE
TCGA.05.4390.01            KRAS         Stage IB FEMALE
TCGA.05.4395.01            KRAS       Stage IIIB   MALE

To define the colours for each feature, you need to build a list using the constructor function list() in which each element will be a named vector of colors as you have already made to create palette.known.onco. For that you have to know all possible categories for each feature. table() is an easy and very useful function to display all categories with the number of sample per category.

table(anno.columns$gender)

FEMALE   MALE 
   122    124 
table(anno.columns$pathologic_stage)

   Stage I   Stage IA   Stage IB  Stage IIA  Stage IIB Stage IIIA Stage IIIB 
         3         58         68         21         36         40          6 
  Stage IV 
        14 
table(anno.columns$pathologic_stage)

   Stage I   Stage IA   Stage IB  Stage IIA  Stage IIB Stage IIIA Stage IIIB 
         3         58         68         21         36         40          6 
  Stage IV 
        14 

etc

You can see that for the pathological stages, there are subcategories A and B that we want to regroup in order to simplify the visualisation. To group the sub-categories, we will remove the A and B from the end of the words using the function sub(). Try to write the code, many answers are possible.

Here are 2 possibilities:

anno.columns$pathologic_stage <- sub("A","",anno.columns$pathologic_stage)
anno.columns$pathologic_stage <- sub("B","",anno.columns$pathologic_stage)
# The cleanest way of coding this task is
anno.columns$pathologic_stage <- sub("[AB]$","",anno.columns$pathologic_stage)
# [AB] indicates that at the last position (sign $) you encounter either A or B
table(anno.columns$pathologic_stage)

  Stage I  Stage II Stage III  Stage IV 
      129        57        46        14 

ii) Annotations of the genes

We want to identify the genes belonging to the MAPK activity signature. To this end, we create a data.frame using the constructor function data.frame() with the row names corresponding to the gene names and containing one column in.signature containing yes or no.

Create a vector containing “yes” or “no” depending on the gene presence in the signature. Many possibilities exist but the best way is to use the function ifelse(). This function takes 3 arguments, an expression comparing the gene names (row names of the table luad_expr) with the genes of interest (gene.of.interest), what the function must return if the comparison is TRUE (the gene is in the signature, “yes”), what the function must return if the comparison is FALSE (the gene is not in the signature, “no”).

To know if the elements of a vector are present in an other vector we use the operator %in%.

genes.in.signature <- row.names(luad_expr) %in% gene.of.interest

Try to write the code that allow to obtain the yes/no vector using ifelse() and genes.in.signature and keep the results in a variable called in.signature.

in.signature <- ifelse(genes.in.signature,"yes","no")

Now you can create the data.frame.

Code
# Create annotation for the rows containing the genes: 
# we want to indcate the genes involved in the signature
anno.rows <- data.frame(row.names=row.names(luad_expr),in.signature)
head(anno.rows)
       in.signature
HLF              no
SIDT1            no
KCNK5            no
NT5E             no
ANKFN1           no
OSR2             no

Create a list colors.anno, specifying the colors of each category of each feature

# create a list containing the colors for each feature
colors.anno <- list(known.oncogenes=palette.known.onco,
                    pathologic_stage=c("Stage I"=colours()[404], 
                                       "Stage II"=colours()[33], 
                                       "Stage III"=colours()[35], 
                                       "Stage IV"=colours()[24]),
                    gender=c(FEMALE=colours()[613], MALE=colours()[290]),
                    in.signature=c(yes="black",no="white"))

Step 2 : Create a clustered heatmap

Display the heatmap with default parameters (do not change any parameters) using the function pheatmap() and the expression object luad_expr.

Code
pheatmap(luad_expr)

By default, pheatmap applies hierarchical clustering to both rows and columns. The parameter names are quite explicit.

By looking at the help of the function (?pheatmap) or online, identify the parameters and their default values that:

  • Enable row or column clustering (2 parameters)
Answer
cluster_cols = TRUE
cluster_rows = TRUE
  • specify the statistics to compute the distances between rows or columns (2 parameters)
Answer
clustering_distance_rows = "euclidean"
clustering_distance_cols = "euclidean"
  • specify how to aggregate the rows and columns based on their distance (1 parameters)
Answer
clustering_method = "complete"

Step 2 : Customize your heatmap

By analysing the default heatmap, we can notice that column labels are not readable and are not adding relevant information, we want to remove them, the row names are readable but we want to decrease the size of the font. We will test different method for clustering the rows and columns. We would like to change the colors of the cells to make them more divergent using for instance the color palette from viridis and we want to intensify the differences by lowering the range value amplitude (between -4 and 4), meaning that values outside of this range will display the extreme colors. We then want to annotate the samples and genes with nice colors, separate the clusters and finally add a tittle.

i) Row and column names

By looking at the help of the function (?pheatmap) or online, identify the parameters (and adapt their values) that:

  • Enable the display of the column names and disable it.
Answer
show_colnames = T # default
  • Enable to decrease the font size
Answer
fontsize = 10 # default
Code
pheatmap(luad_expr,
         cluster_rows = T, cluster_cols = T, # set to FALSE if you want to remove the dendograms
         show_colnames = F, # displaying column names
         show_rownames = T, # displaying row names
         fontsize = 8 # change font size
         )

ii) Change the colors

By looking at the help of the function (?pheatmap) or online, identify the parameters (and adapt their values) that:

  • Enable to set the color palette. In the example we used the inferno palette from the package viridis and asked for 100 colors.
Answer
color = colorRampPalette(rev(brewer.pal(n = 7, name = "RdYlBu")))(100) # Default
  • Enable to change the amplitude of the colors between -4 and 4
Answer
breaks = NA # default
Code
pheatmap(luad_expr,
         cluster_rows = T, cluster_cols = T, # set to FALSE if you want to remove the dendograms
         show_colnames = F, # displaying column names
         show_rownames = T, # displaying row names
         color = viridis::inferno(100), # choose a colour scale for your data
         breaks = seq(-4,4,0.08), # seq produce a vector from -4 to 4 with a step of 0.08 (100 values)
         fontsize = 8 # change font size
         )

iii) Choose the clustering methods

Using the parameters you identified earlier for clustering methods, test several combinations to observe the impact on the results. You can find bellow the definition of the different methods. In the example, we chose to cluster following the correlation distance and the ward.D agglomeration methods.

Distance methods

Euclidean Distance: This is like drawing a straight line between two points. It’s the most intuitive way to measure distance, like the shortest path “as the crow flies” between two locations on a map.

Manhattan Distance: Think of how you would walk in a city laid out like a grid, where you can only move horizontally or vertically. The Manhattan distance measures the total number of blocks you would walk.

Correlation Distance: Instead of focusing on actual distance, this measures how similar the patterns or trends are between points. It’s useful when you care about how things change together rather than how far apart they are.

Maximum Distance (Chebyshev): This only looks at the largest difference in any single direction (horizontal or vertical). Imagine trying to figure out how far two buildings are by only checking the longest side of the blocks between them.

Canberra Distance: This one gives more weight to differences when values are small. It’s useful when small differences matter a lot in your data.

Binary Distance: This is used for yes/no or presence/absence type data. It simply compares if two points have the same “yes/no” values.

Minkowski Distance: This is a general form of both Euclidean and Manhattan distances. By changing a parameter, it can behave like either Euclidean (straight line) or Manhattan (grid-based).

Agglomeration methods

Ward’s Method (ward.D, ward.D2): This tries to form groups that minimize the difference within each cluster. It prefers clusters where data points are very similar to each other. It’s great for balanced, compact groups.

Single Linkage (single): This method focuses on the smallest distance between points in two clusters. It can create “chains” of points that are loosely connected, leading to long, stretched-out clusters.

Complete Linkage (complete): This method uses the largest distance between points in two clusters. It prefers tight, compact clusters, and avoids the stretched-out effect of single linkage.

Average Linkage (average/UPGMA): Here, we average the distances between all pairs of points in two clusters. It balances between tight and loose clusters, forming more moderate-sized groups.

McQuitty’s Method (mcquitty/WPGMA): Similar to average linkage, but it calculates the average in a slightly different way. It’s another method to balance between close and distant points.

Median Method (median/WPGMC): This method looks at the median distances between points, creating balanced clusters that can handle uneven data well.

Centroid Method (centroid/UPGMC): This method calculates the “center” of clusters (think of the average location of all points in the cluster) and uses that to decide how to group them.

Code
pheatmap(luad_expr,
         cluster_rows = T, cluster_cols = T, # set to FALSE if you want to remove the dendograms
         show_colnames = F, # displaying column names
         show_rownames = T, # displaying row names
         clustering_distance_cols = 'correlation',
         clustering_distance_rows = 'correlation',
         clustering_method = 'ward.D', # how to aggregate rows and columns depending on their distance
         color = viridis::inferno(100), # choose a colour scale for your data
         breaks = seq(-4,4,0.08), # seq produce a vector from -4 to 4 with a step of 0.08 (100 values)
         fontsize = 8 # change font size
         )

iv) Separate clusters

By looking at the help of the function (?pheatmap) or online, identify the parameters (and adapt their values) that:

  • Enable to show distinct clusters, in the example, 6 sample clusters and 4 gene clusters (2 parameters)
Answer
 cutree_rows = NA # Default
 cutree_cols = NA # Default
Code
pheatmap(luad_expr,
         cluster_rows = T, cluster_cols = T, # set to FALSE if you want to remove the dendograms
         show_colnames = F, # displaying column names
         show_rownames = T, # displaying row names
         clustering_distance_cols = 'correlation',
         clustering_distance_rows = 'correlation',
         annotation_row = anno.rows, # row (gene) annotations
         annotation_col = anno.columns, # column (sample) annotations
         annotation_colors = colors.anno, # colours for your annotations
         clustering_method = 'ward.D', # how to aggregate rows and columns depending on their distance
         color = viridis::inferno(100), # choose a colour scale for your data
         breaks = seq(-4,4,0.08),
         cutree_cols = 6, # display column clusters
         cutree_rows = 4, # display row clusters
         fontsize = 8 # change font size
         )

---